Part 1: MAP

Q 1) Generating 50 2D-data points using y=sin(x^2 + 1): i) The following data is generated using 50 equally spaced input values stored in the array x_arr and the function y=sin(x^2 + 1) ii) Gaussian noise is added to the output using mean 0 and standard deviation as 0.04.

Fitting generated data using MAP: i) The value of alpha is considered as 0.4 ii) the value of sigma is taken to be 0.04 and thus the value of beta is calculated as (1/sigma^2) The MAP estimate for w, w_map is then taken as

wmap.JPG

Q7) Varying values of M and alpha: i) Alpha values are varied while M value is set to be a constant value of 10 ii) M value is varied while alpha value is set to be a constant value of 0.2 The variations are plotted and observed accordingly

Varying alpha and keeping M constant:

Keeping alpha constant and varying M:

Q8) Describing influence of alpha and M on accuracy: From the above observation plots, it can be seen that overfitting takes places with small values of alpha for a particular value of M and decreases to provide better fit for higher fit. On the other hand, when alpha is kept constant and M is varied, it can be observed that underfitting takes place for small values of M,i.e until M=2 beyond which it is observed to fit correctly between the range of 3 to 20. However, in both cases,i.e varying alpha parameter and varying M parameter, it can also be seen that further increase in the value of either of the hyperparameters leads to overfitting.

Part 2: Basis Function

1) Generating the data

The basis functions have been generated as per the formuae provided below: gaussian_basis.JPG

sigmoid_basis.JPG

Generating sample gaussian basis function:

2) Fitting the data using MLE: Here the dimensionality M of the input is considered to be 9 while the sigma value of the gaussian basis function is considered to be 0.3. The w parameter obtained through MLE has been estimated using the equation given in the image below: w_mle.JPG

3) Generating sample overfitting curve for M=200 and sigma=0.01: From the plot containing the generated curve below, it can be seen that the curve overfits at all the noise related points.

3) Generating sample underfitting curve for M=2 and sigma=0.5 using gaussian basis function: From the plot containing the generated curve below, it can be seen that the curve is undefitting.

3) Generating sample overfitting curve for M=100 and sigma=0.01 using sigmoid basis function: From the plot containing the generated curve below, it can be seen that the curve overfits at all the noise related points.

3) Generating sample underfitting curve for M=2 and sigma=0.5 using sigmoid basis function: From the plot containing the generated curve below, it can be seen that the curve is underfitting.

4) i) Varying M value to generate overfitting and underfitting curves for gaussian basis function:

5) Varying M value to generate overfitting and underfitting curves for sigmoid basis function:

6) Generating the function y= 0.4345np.power(x_arr,3)-(5.607np.power(x_arr,2))+(16.78*x_arr)-10.61: i) Input data consists of 50 evenly spaced points lying between 0 to 8.5 ii) The noise generated for the output has a mean of 0 and variance sigma value of 2

Obtaining best fit curve using sigmoid basis function: M is set as 100 while the variance for the sigmoid basis function is generated as 1.5

7) i) Varying parameters of Gaussian basis function:

Varying M and keeping sigma constant for gaussian basis:

Keeping M constant and varying sigma for gaussian basis:

i) From the above plots, it can be seen that when M is varied keeping sigma constant, underfitting occurs for small values of M beyond which correct fitting takes place. ii) When M is kept constant and sigma for the gaussian basis function is varied, the curve is observed to correcty fit at lower values of sigma while it underfits at higher values.

Varying parameters of sigmoid basis function:

Keeping sigma constant and varying M values:

Varying sigma values and keeping M constant for sigmoid basis function:

i) From the above plots it can be seen that when sigma is kept constant and M is varied for sigmoid basis function, the curve underfts at lower values of M beyond which it begins to correctly fit. ii) when sigma is varied and M is kept constant, the curve undefits at lower values of sigma following which it undergoes correct fitting at higher values of sigma.

Polynomial functions when used do not provide a bound on values obtained from using higher values of M, i.e for very large values of M, the corresponding input for that value x_n^M can be a very high value which can in turn make it difficult to obtain a solution for fitting the curve. Basis Functions such as the Gaussian Basis Function and the Sigmoid Basis Function provide a limit on values for higher values of M as a result of which fitting is found to become better. Polynomial functions are more prone to overfitting while basis functions can help generalize better. Polynomial functions are global functions and thus changes in one region of space can affect all other regions while basis functions such as gaussian basis functions and sigmoid basis functions are local basis functions where changes in the input will only affect only the surrounding region and not the entire space. Polynomial functions can help fit increasing and decreasing functions while basis functions can fit better to non linear functions that are not continuously increasing or decreasing.

Part 3: Full Bayesian + Predictive Distribution

1) Generating the data:

2) Displaying best fitting curve and the estimated values of w:

Varying alpha and beta terms to observe variations:

From the above plots obtained from varying the values of alpha and beta, it can be seen that decrease in the value of beta and increase in the value of alpha leads to increase in underfitting while for small values of alpha and high values of beta, the curve fits in the correct manner. Similary higher value of alpha implies higher uncertainty in the prior distribution of w and can therefore lead to underfitting. Decrease in the value of beta implies greater variance and therefore leads to underfitting. Alpha and beta together are used to form the regularization parameter that can in turn be used to prevent overfitting.

4) Generating Best Fit Curve using Gaussian Basis Function:

Q5) The use of p(w/t) in training and test stages: i) The posterior probability p(w/t) takes into account the likelihood as well as the prior. Since the prior information about w in the form of the prior probability p(w) is taken into account, the inference obtained as a result varies accordingly. ii) The uncertainty information about the parameter w is also provided which can in turn help in better training and testing performance. iii) In the case of testing, the probability(t/t) where t is prediction of incoming test input x* and t corresponds to the target value of the train data, is obtained through the posterior probality p(w/t). This predictive distribution gives the most probable value as well as the uncertainty associated with the predictions.

7) Implementing Sequential Learning:

From the above sequential learning plots, it can be observed that the curve is able to fit better as it learns better with the addition of data points. The predictive distribution variance also decreases with the addition of more data points sequentially, thereby indicating greater certainty in its predictions.

Q8) i) In the case of full bayesian inference, the uncertainty SN is used to obtain the posterior probability p(w/t) whereas in the case of predictive distribution, the probability(t/t) is calculated using the posterior distribution p(w/t) and p(t/w) in order to obtain the correlation between the target of the test input t* and the target of train data t. ii) From the training data, it can be made possible to intrapolate or extrapolate the prediction of t and calculate its uncertainty which thus makes predictive distribution more beneficial as compared to the full bayesian inference. The posterior probability calculated in the full bayesian inference provides an estimate of the probable values of the parameter w and does not provide any information related to how predictions turn out using the posterior obtained.

Q9) The probability p(t_new/t) which represents the predictive distribution, provides a probaility estimate of predicted values along with their uncertainty using the gaussian likelihood p(t*/w) and posterior distribution p(w/t). This is in turn used to infer the most probable predicted target value that is associated with the new test input provided. The probability p(w/t) on the other hand represents the posterior probablity of the parameter w given the observed data t for the input training data using the likelihood p(t/w) and the prior p(w). This posterior can then be used in order to obtain the predictive distribution for the test inputs.

Part 4: Classification using Full Bayesian + Predictive Distribution

1) Generating the data:

Initializing the various components required to obtain the optimal parameter w: i)The initial values of w are considered to be 0 while the variance for the prior distribution is considered to be (1/alpha)*I ii) The target labels are concatenated together in a 1D Array where the label for class 0 inputs is taken to be 0 while the rest of the inputs belonging to class 1 are labelled as 1.

Iterating through Newton's method using Hessian Matrix to obtain the optimal value for w parameter:

Calculating the prediction probability for input data

3) Generating the probability map on test data: i) The w parameter obtained is used to calculate prediction probabilities for the 2 dimensional test inputs provided. The probability map is then generated using the predictive distribution obtained.

4) Predicting custom input from user: (Kindly enter in this format-> num1 num2)